How do social scientists collect data
to answer the questions?

PSCI 2270 - Week 3

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

September 15, 2023

Plan for this week


  1. Learning about Population from Sample

  2. Descriptive Statistics

  3. Necessary Math

  4. Types of Data Collection

Plan for this week

  1. Learning about Population from Sample

Sampling Lingo:


  • We often cannot survey or measure outcome among the whole set of units we are interested in \(\Rightarrow\) Target population

    • Example: People who will vote in the next election; All news reports from Fox News
  • We then have to resort to a subset of units that we can reasonably collect data for \(\Rightarrow\) Sample

    • Example: Those who participate in the survey; News reports that we can download online
  • We collect the sample from the available list that ideally includes the whole population \(\Rightarrow\) Sampling frame

    • Selection bias: list of registered voters (frame) might include nonvoters!

Learning about populations

  • Probability: formalize the uncertainty about how our data came to be
  • Inference: learning about the population from a set of data

Types of sampling


  • Probability sampling: Every unit in the population has a known probability of being selected into sample
  • Simple random sampling: Every unit has an equal selection probability

    • e.g. random digit dialing (RDD):

      1. Take a particular area code + exchange: 617-495-XXXX.
      2. Randomly choose each digit in XXXX to call a particular phone.
      3. Every phone number in America has an equal chance of being included in sample.
  • Quota sampling, cluster sampling, etc.
  • Non-probability sampling: e.g. Opt-in Internet panels

    • Question: Can you see a problem with this?

1936 Literary Digest Poll



  • Literary Digest predicted elections using mail-in polls

  • Source of addresses: automobile registrations, phone books, etc.

  • In 1936, sent out 10 million ballots, over 2.3 million returned

  • George Gallup sampled only 50,000 respondents from all voting age citizens

Poll’s Result: Fail!

Pollster FDR’s Vote Share
Literary Digest 43%
George Gallup 56%
Actual Outcome 62%


  • Ballots skewed toward the wealthy (with cars, phones) \(\Rightarrow\) selection bias

    • Only 1 in 4 households had a phone in 1936.
  • People who respond could be different than those who don’t \(\Rightarrow\) nonresponse bias
  • Note: When selection procedure is biased, adding more observations doesn’t help!

1948 Election

The Polling Disaster

Pollster Truman Dewey Thurmond Wallace
Crossley 45% 50% 2% 3%
Gallup 44% 50% 2% 4%
Roper 38% 53% 5% 4%
Actual Outcome 50% 45% 3% 2%


  • Quota sampling:

    • fixed quota of certain respondents for each interviewer
    • sample resembles the population on these characteristics
  • Potential unobserved confounding \(\Rightarrow\) selection bias

  • Republicans easier to interview within quotas (phones, listed addresses, etc.)

Plan for this week


  1. Learning about Population from Sample
  1. Descriptive Statistics

What to DO with Measured Outcomes


  • A variable is a series of measurements about some concept
  • Descriptive (summary) statistics are numerical summaries of those measurements

    • If we smart enough, we wouldn’t need them: just look at the list of numbers and completely understand
  • Two salient features of a variable that we want to know:

    • Central tendency: where is the middle/typical/average value
    • Spread around the center: are all the data close to the center or spread out?

Center of the data: Mean

  • Center of the data: typical/average value
  • Mean: sum of the values divided by the number of observations

\[ \color{#98971a}{\bar{x}} = \color{#d65d0e}{\frac{1}{n}} \color{#458588}{\sum_{i = 1}^{n} x_{i}} \]

  • What’s all this notation?

    • Population value: Greek letters
    • Sample value: Latin letters
    • Miscellaneous squiggles (sums, hats, bars, subscripts)
  • Applied to the mean:

    • Population value: \(\mu\) (say mu)
    • Sample value: \(\hat{\mu}\) (say mu-hat); \(\bar{x}\) (say “x-bar”)

Center of the data: Median


  • Median: \[ \text{median} = \begin{cases} \text{middle value} & \text{if number of entries is odd} \\ \frac{\text{sum of two middle values}}{2} & \text{if number of entries is even} \end{cases} \]
  • Median more robust to outliers:

    • Example 1: \(\text{data} = \{ 0, 1, 2, 3, 5 \}\). \(\text{mean} = 2.2\), \(\text{median} = 2\)
    • Example 2: \(\text{data} = \{ 0, 1, 2, 3, 100 \}\). \(\text{mean} = 21.2\), \(\text{median} = 2\)
  • Question: What does Elon Musk do to the mean vs median income? \(\Rightarrow\) income inequality measure

Spread of the data

  • Are the data close to the center?
  • Range: \(\left[ \min (X), \max (X) \right]\)
  • Quantile (quartile, quintile, percentile, etc):

    • 25th percentile = lower quartile (25% of the data below this value)
    • 50th percentile = median (50% of the data below this value)
    • 75th percentile = upper quartile (75% of the data below this value)
  • Interquartile range (IQR): a measure of variability

    • How spread out is the middle half of the data?
    • Is most of the data really close to the median or are the values spread out?
  • One definition of outliers: over 1.5 × IQR above the upper quartile or below lower quartile

Standard deviation

  • Standard deviation (\(\sigma\), sd): On average, how far away are data points from the mean?

\[ \text{sd} = \color{#cc241d}{\sqrt{\color{#b16286}{\frac{1}{n - 1}} \color{#98971a}{\sum_{i = 1}^{n}} \color{#458588}{(}\color{#d65d0e}{x_i - \bar{x}}\color{#458588}{)^2} }} \]

  • Steps:

    1. Subtract each data point by the mean
    2. Square each resulting difference
    3. Take the sum of these values
    4. Divide by \(n − 1\)
    5. Take the square root
  • Variance (\(\sigma^2\), var): \(\text{Var} = \text{standard deviation}^2\)
  • Question: Why not just take the average deviations from mean without squaring?

Plan for this week


  1. Learning about Population from Sample

  2. Descriptive Statistics

  1. Some Math

Some Building Blocks


  • Probability:

    • Basis for understanding uncertainty in our estimates
    • Statistics is applied probability
  • Law of Large Numbers

    • Perform the same task over and over (e.g., draw a sample)
    • Average of the results converges to the truth
  • Central Limit Theorem:

    • Add up a lot of independent factors
    • Result follows the normal distribution

Large random samples


  • In real data, we will have a set of n measurements on a variable: \(X_1\) , \(X_2\), … , \(X_n\)

    • \(X_1\) is the age of the first randomly selected registered voter.
    • \(X_2\) is the age of the second randomly selected registered voter, etc.
  • Empirical analyses: sums or means of these n measurements

    • All statistical procedures involve a statistic, very often sum or mean.
    • What are the properties of these sums and means?
    • Can the sample mean of age tell us anything about the population distribution of age?
  • Asymptotics: what can we learn as \(n\) gets big?

Stats Lingo: LLN


Law of Large Numbers (LLN)

Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.


  • Intuition: The probability of \(\bar{X}_n\) being “far away” from \(\mu\) goes to \(0\) as \(n\) gets big
  • The distribution of sample mean “collapses” to population mean

Normal Distribution

  • The normal distribution is the classic “bell-shaped” curve.

    • Extremely ubiquitous in statistics
    • mean and variance follow standard notation
    • When \(X\) is distributed normally, we write \(X \sim N ( \mu, \sigma^2 )\)
  • Three key properties:

    • Unimodal: one peak at the mean
    • Symmetric around the mean
    • Everywhere positive: any real value can possibly occur

Stats Lingo: CLT


Central Limit Theorem (CLT)

Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) in large samples.


  • Approximation is better as \(n\) goes up \(\Rightarrow\) asymptotics

  • “Sample means tend to be normally distributed as samples get large.”

    • We now know how far away \(\bar{X}_n\) will be from its mean!

Impications of CLT/LLN


  • By CLT, sample mean \(\approx\) normal with mean \(\mu\) and sd of \(\sigma^2 / n\)
  • By empirical rule, sample mean will be within \(2 \times \sigma^2 / n\) of the population mean 95% of the time
  • We usually only 1 sample, so we’ll only get 1 sample mean. So why do we care about LLN/CLT?

    • CLT gives us assurances our sample mean won’t be too far from population mean
    • CLT will also help us create measure of uncertainty for our estimates, standard error (SE):

    \[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

Plan for this week


  1. Learning about Population from Sample

  2. Descriptive Statistics

  3. Some Math

  1. Types of Data Collection

Data Collection Methods

  • Interview data: Data that are collected from responses to questions posed by the researcher to a respondent

    • Examples: Individual interviews one the streets, focus groups, large surveys
  • Firsthand observation: Data that may be collected by making observations in a field study or in a laboratory setting

    • Examples: Playing games in the university laboratory, collecting facebook/twitter user data
  • Document analysis: Use of any audio, visual, or written materials as a source of data

    • Examples: Government statistics, geospatial data coding, archival work, databases of media reports, data collected by NGOs and private organizations

Interview Data Collection


  • Interview data can be collected

    • Face-to-face
    • By phone
    • Online via survey platform
  • Sample size:

    • Small-\(N\): usually face-to-face and can be semi-structured
    • Large-\(N\): Rarely conducted by the researchers themselves (Why?) and has to be structured

What to Consider



  • Context: Which mode is most common/appropriate among the study population?
  • Depth/Length: Is this a short factual survey or in-depth interview
  • Sensitivity: Do you ask questions that require trust between researcher and respondent
  • Logistics: What is the easiest/cheapest way to reach the respondents
  • Security/Privacy: Are there concerns about using specific mode of interview?

Document Analysis


  • Document is understood very broadly

    • Any pre-existing surveys
    • Text, image, audio and video materials
    • Databases available online ( Vanderbilt Databases )
  • Running record: Materials that are collected systematically across time

    • Examples: Statistical agencies, databases with indicators, systematic geographic data, etc.
  • Episodic records: Records produced in casual, personal, and accidental manner

    • Example: Diaries, personal blogs, social media posts

What to Consider



  • Accessibility: Do you need to collect/process data from data source?
  • Coverage: Are there systematic biases in data available?
  • Interpretation: Is primary data/methodology available? Can you infer it?

Firsthand Observation


  • Usually conducted in the field (travel to location), in the lab (stay home!), or in the lab-in-the-field (🤔)
  • Direct observation: Observing the political behavior itself

    • Example: Laboratory observation of group interactions, observation of worker movement when working at the factory, browsing history of participants, etc.
  • Indirect observation: Observing physical trace of the political behavior

    • Example: Political protests in factories (political mobilization), improvements in services provided to community (collective action)

What to Consider



  • Research Question: Is pure observation of behavior enough? What about causal explanations?
  • Observer effect: Does your observation interfere with how subjects behave?
  • Safety: Does political behavior you study pose danger to you or enumerators?
  • Privacy/Anonymity: Is your research posing any risk of harm or could implicate participants?

Choosing between methods

  • Validity of the measure

  • Combine different collection strategies for reliability

  • Reactivity: Effect of the data collection on the phenomena being measured
  • Ethics: Risks of harm for your subjects (is your study worth it?)
  • Subjectivity: How much freedom you have in interpreting the results?
  • Also do what you like (!)

Ethics

  • Primary concern is Beneficience: Does your study pose harm to the observed?
  • Key examples:

    • Negative repercussion from associating with the researcher because of researcher’s sponsors, nationality or status
    • Invasion of privacy
    • Stress during the research interaction
    • Disclosure of behavior or information causes harm to observed during or after the study
  • Also to consider:

    • Autonomy: ability to consent, withdraw, etc.
    • Justice: distribution of benefits and burdens

Everything is Possible (Almost)

  • Question: Which factors affect protest participation?

Answers:

  • Case study: Lohmann (1994)
  • Process-tracing: Pearlman (2013)
  • Laboratory experiment: Young (2019)
  • Social network analysis: Larson et al. (2019)
  • Using original surveys: Boulianne and Sangwon Lee (2022)
  • Field experiments: Bursztyn et al. (2021)
  • Original data on protests: Steinert-Threlkeld (2017)

Next Week



  • Think about possible data strategies for answering question: Which factors affect election participation?

  • Applying CLT/LLN to get point estimates and estimates of uncertainty

  • Comparing group means and logic of causal inference

References

Boulianne, Shelley, and Sangwon Lee. 2022. “Conspiracy Beliefs, Misinformation, Social Media Platforms, and Protest Participation.” Media and Communication 10 (4). https://doi.org/10.17645/mac.v10i4.5667.
Bursztyn, Leonardo, Davide Cantoni, David Y Yang, Noam Yuchtman, and Y Jane Zhang. 2021. “Persistent Political Engagement: Social Interactions and the Dynamics of Protest Movements.” American Economic Review: Insights 3 (2): 233–50.
Larson, Jennifer M., Jonathan Nagler, Jonathan Ronen, and Joshua A. Tucker. 2019. “Social Networks and Protest Participation: Evidence from 130 Million Twitter Users.” American Journal of Political Science 63 (3): 690–705. https://doi.org/10.1111/ajps.12436.
Lohmann, Susanne. 1994. “The Dynamics of Informational Cascades: The Monday Demonstrations in Leipzig, East Germany, 198991.” World Politics 47 (1): 42–101. https://doi.org/10.2307/2950679.
Pearlman, Wendy. 2013. “Emotions and the Microfoundations of the Arab Uprisings.” Perspectives on Politics 11 (2): 387–409. https://doi.org/10.1017/s1537592713001072.
Steinert-Threlkeld, Zachary C. 2017. “Spontaneous Collective Action: Peripheral Mobilization During the Arab Spring.” American Political Science Review 111 (2): 379–403. https://doi.org/10.1017/s0003055416000769.
Young, Lauren E. 2019. “The Psychology of State Repression: Fear and Dissent Decisions in Zimbabwe.” American Political Science Review 113 (1): 140–55. https://doi.org/10.1017/S000305541800076X.